Skip to content

fix(memory_tree,sync_status,scripts): IMMEDIATE-tx ingest, reembed skip-persistence, sidecar-based sync-status accounting, Windows dev-script PATH#2349

Merged
M3gA-Mind merged 8 commits into
tinyhumansai:mainfrom
sanil-23:fix/memory-tree-and-win-dev
May 21, 2026
Merged

fix(memory_tree,sync_status,scripts): IMMEDIATE-tx ingest, reembed skip-persistence, sidecar-based sync-status accounting, Windows dev-script PATH#2349
M3gA-Mind merged 8 commits into
tinyhumansai:mainfrom
sanil-23:fix/memory-tree-and-win-dev

Conversation

@sanil-23
Copy link
Copy Markdown
Contributor

@sanil-23 sanil-23 commented May 20, 2026

Summary

  • memory_tree::ingest::persist() uses an IMMEDIATE-tx for the read-then-write transaction. The previous DEFERRED tx bypassed the 15 s busy_timeout on lock upgrades (SQLite returns SQLITE_BUSY immediately for read→write upgrade conflicts, by design, to avoid deadlocks), so every Gmail per-message ingest failed under worker contention. Live evidence: 13/13 fails → 0/N on the verification run.
  • reembed_backfill runaway loop fixed: terminal-skip rows (body file missing on disk, embed wrong-dim, embed err) are now persistently tombstoned in two new *_reembed_skipped tables; the worklist/probe queries (handler batch + has_uncovered runtime + migration) exclude them via NOT EXISTS. Live verification on a dev workspace: 308 orphan chunks each attempted exactly once, chain complete reached in ~2 min — replaces a runaway that previously generated ~128 k body read failed; skipping warns across ~8 k batch defers for ~16 stuck chunks because the "skipping" was log-only with no persistent exit gate.
  • scripts/run-dev-win.sh — two Windows-contributor bugfixes for pnpm dev:app:win: (a) prepend dirname($PNPM_EXE) to PATH so Tauri's cmd /S /C "pnpm run dev" can resolve bare pnpm; (b) deduplicate PATH right before tauri dev so the MSYS→Windows env-conversion doesn't overflow the process-env block (without dedup, the spawned cmd inherits an EMPTY PATH).

Problem

Observed in production-mode use on Windows over a long debugging session:

  1. Gmail per-message ingest persists 0 messages under worker contention — every per-message ingest_email write fails with database is locked. The connection has busy_timeout(15s) configured, but the failures arrive in milliseconds, not seconds. Cause: a DEFERRED transaction takes a read lock then tries to upgrade to write on the first INSERT; under contention with the memory-tree worker pool, SQLite returns SQLITE_BUSY immediately for the lock upgrade and does NOT invoke the busy handler (documented deadlock-avoidance behaviour). The 15 s busy_timeout is bypassed; ingest fails fast every time; composio_sync never persists anything.
  2. reembed_backfill zombie loop — runaway re-selection of orphan chunks that can never embed (their body files are missing on disk, e.g. after a workspace path move). Generates ~128 k log warns / hour and burns one worker slot every 5 s indefinitely. Root cause: the *_embeddings tables only record successful embeddings; the worklist query WHERE NOT EXISTS embeddings keeps returning the same failing chunks because the "skipping" branch in the handler is log-only — no persistent state is written to exit the worklist.
  3. pnpm dev:app:win doesn't start cleanly on WindowsbeforeDevCommand dies with 'pnpm' is not recognized. Two underlying causes: (a) pnpm's install directory was never added to PATH (the script otherwise only calls pnpm by absolute path); (b) the bash-side PATH stacks the system PATH ~5× over (vcvars rebuild + /etc/profile re-runs + pnpm .bin layering), and the MSYS→Windows env-block conversion drops the entire PATH variable when it exceeds the limit — so the spawned cmd inherits an empty PATH (not "pnpm missing" — every Windows PATH entry missing, including %SystemRoot%\System32, so even where returns "not recognized").

Solution

Concern File(s) What
Gmail "database is locked" memory_tree/ingest.rs:216 rusqlite::Transaction::new_unchecked(conn, TransactionBehavior::Immediate) instead of conn.unchecked_transaction() (deferred). Locks at BEGIN where the busy handler applies; writers serialize and wait instead of failing fast.
Reembed runaway loop memory_tree/store.rs, tree_source/store.rs, jobs/mod.rs, jobs/handlers/mod.rs Two new idempotent CREATE TABLE IF NOT EXISTS rows (mem_tree_chunk_reembed_skipped, mem_tree_summary_reembed_skipped), two helpers (mark_chunk_reembed_skipped, mark_summary_reembed_skipped), three worklist/probe SQL queries get an AND NOT EXISTS … reembed_skipped clause (handler batch in handlers/mod.rs, the has_uncovered runtime probe in jobs/mod.rs, the has_uncovered migration probe in store.rs). Handler skip branches now also write a sentinel via let _ = mark_*_reembed_skipped(...) (best-effort: if the sentinel write itself fails, the row simply gets retried on the next batch — desirable fallback).
Windows dev script scripts/run-dev-win.sh (a) After PNPM_EXE is resolved, export PATH="$(dirname "$PNPM_EXE"):$PATH". (b) Right before "$PNPM_EXE" tauri dev, an in-script first-seen-wins dedup loop collapses PATH (entries are clean /c/... POSIX so :-split is safe). Adds two [run-dev-win] info-level log lines so the next contributor can see at a glance what happened (pnpm dir prepended to PATH: ..., PATH de-duplicated: 317 → 71 entries).

Design choices worth flagging for reviewers:

  • Why a separate tombstone table rather than NULL-sidecar rows for reembed-skip: the existing *_embeddings table has vector BLOB NOT NULL and dim INTEGER NOT NULL, so a NULL sentinel doesn't fit the schema, and an empty-blob/zero-dim sentinel would pollute the embedding readers (they'd need a special-case for "not a real embedding"). A separate small table keeps the embedding semantics clean and the worklist query's NOT EXISTS predicate trivial.
  • Why three SQL sites get the same NOT EXISTS clause rather than refactoring into one helper: the three are different shapes (handler batch returns a list with LIMIT ?; the runtime + migration probes are SELECT EXISTS(... OR ...)); abstracting them would either lose query-shape clarity or require a builder. The duplication is small and the in-place comments cross-reference each other, so future drift would be obvious.

Submission Checklist

  • N/A: tests deferred to a follow-up commit — see Validation Blocked below. The change is end-to-end verified on a live workspace (308 sentinel rows written + chain complete reached, 0 ingest lock failures observed in the new run) but unit-test additions for the new SQL paths are pending. The PR is opened as DRAFT for this reason. There IS an existing test reembed_backfill_repopulates_then_completes in handlers/mod.rs:1062 that already exercises the batch-defer-then-complete cycle — it should still pass with the new NOT EXISTS clause (the sentinel is only written on failure paths the test doesn't trigger), but I have not run it explicitly post-change.
  • N/A: diff coverage gate will not pass on this commit alone — same reason as above. Will land tests in follow-up before un-drafting.
  • N/A: no behaviour-only feature ID change — the IMMEDIATE-tx fix and the reembed-skip persistence are bugfixes on existing capabilities; no new top-level feature row required.
  • N/A: no matrix-tracked feature IDs touched (changes are SQL paths / Windows dev script).
  • No new external network dependencies — all changes are local SQLite or shell-script.
  • N/A: no release-cut surface touched. The Windows dev-script change is for the dev workflow only; the runtime behaviour of pnpm dev:app:win is unaffected for contributors whose machines were already correctly configured (the two fixes are additive — they only matter when pnpm's dir isn't otherwise on PATH and when PATH is otherwise overflowing).
  • N/A: no linked issue — the changes were derived from a long-running production-mode debugging session, not a tracked ticket. Happy to file follow-up issues if reviewers prefer.

Impact

Runtime / platform: desktop, all platforms for the Rust changes; the run-dev-win.sh change is Windows-only.

Performance: net positive. Gmail ingest no longer drops messages and serializes correctly; the reembed worker no longer wastes one slot every 5 s on the same orphan reads.

Migration / compatibility: two new SQLite tables added via the existing CREATE TABLE IF NOT EXISTS SCHEMA const path. No user_version bump; tables materialise lazily on next with_connection. Backward-compatible: older binaries opening the DB will simply ignore the new tables.

Security: no privilege changes.

Related

  • Closes:
  • Follow-up PR(s)/TODOs:
    • Unit tests for the IMMEDIATE-tx ingest path under simulated lock contention and for the reembed sentinel write/read cycle.
    • One-shot data cleanup or admin RPC for the ~308 sentinel rows now on the dev workspace (cosmetic — the rows are correct, just noisy in diagnostics).

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/memory-tree-and-win-dev
  • Commit SHA: 9b9f54ea

Validation Run

  • N/A: pnpm --filter openhuman-app format:check not run — Windows working tree has the documented ~600-file CRLF/LF drift; format:check would rewrite unrelated files. Pre-push hook bypassed via --no-verify per the same documented pattern.
  • N/A: pnpm typecheck not run — no TypeScript surface changed in this PR.
  • Focused tests: the existing reembed_backfill_repopulates_then_completes integration-style test in jobs/handlers/mod.rs:1062 should still pass with the new NOT EXISTS clause (it triggers the success path the sentinel is not written on); I have not run it explicitly post-change.
  • Rust fmt/check: cargo check ran implicitly via pnpm dev:app:win rebuild — 0 errors. cargo fmt not explicitly run; will run in the follow-up that adds tests.
  • N/A: Tauri fmt/check — no app/src-tauri/ Rust changes in this PR.

Validation Blocked

  • command: cargo fmt and unit-test additions for the IMMEDIATE-tx and reembed-skip paths
  • error: n/a — not blocked technically; bundled with the test-coverage follow-up before un-drafting.
  • impact: PR is opened as DRAFT; reviewer feedback welcome on the schema + worklist-SQL approach before I invest in the test suite.

Behavior Changes

  • Intended behavior change:
    1. Gmail per-message ingest no longer drops messages under worker contention — the IMMEDIATE-tx makes the 15 s busy_timeout actually apply on the read-then-write block in persist().
    2. reembed_backfill chain terminates instead of looping forever on orphan chunks — terminal failures are persistently tombstoned and excluded from the worklist.
    3. pnpm dev:app:win actually starts on a fresh Windows checkout without manual workarounds.
  • User-visible effect:
    • Gmail sync writes actually persist; "Memory Sources" counts grow instead of staying at zero.
    • Logs are no longer flooded with [memory_tree::jobs] reembed_backfill: chunk … body read failed; skipping (was ~128 k warns / run before, now once per orphan per signature).
    • New contributors on Windows can run pnpm dev:app:win and have a Tauri shell open.

Parity Contract

  • Legacy behavior preserved:
    • mem_tree_chunk_reembed_skipped / mem_tree_summary_reembed_skipped are created idempotently — older code opening the DB pre-this-change works unchanged; rolling back this commit removes only the additive tables.
    • The NOT EXISTS … reembed_skipped clauses are pure subtraction from the worklist — every chunk that previously appeared still appears unless it has a terminal-failure sentinel, which only the new handler code can write.
    • mark_*_reembed_skipped is best-effort (let _ =): a transient write failure does not crash the batch; the row simply gets retried on the next batch (desired fallback).
    • IMMEDIATE-tx is strictly safer than DEFERRED: it never deadlocks (acquires write lock at BEGIN), and busy_timeout already governs the wait. Existing successful writers under no contention behave identically.
    • run-dev-win.sh changes are additive and only fire conditions that already produced errors (pnpm-not-on-PATH; PATH overflow). Contributors whose machines didn't hit those conditions see no behaviour change.
  • Guard/fallback/dispatch parity checks: the worklist SQL was updated in all 3 sites that reference the same predicate (handler batch + has_uncovered runtime probe + has_uncovered migration probe) to avoid drift between sites; each unconditionally adds AND NOT EXISTS … reembed_skipped so the three remain semantically aligned.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): none
  • Canonical PR: this one
  • Resolution: n/a

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Windows dev startup: improved PATH handling and de-duplication so the package manager is reliably found and startup logs PATH state.
    • Embedding backfill: unembeddable chunks/summaries are permanently marked per model signature and excluded from reprocessing, preventing infinite retry loops.
    • Sync status reporting: pending/processed counts now reflect migrated sidecars, fixing incorrect status totals.
    • Ingest stability: transaction behavior adjusted to reduce concurrency/busy-timeout issues during persistence.

Review Change Stack

…istence + Windows dev-script PATH fixes

memory_tree::ingest::persist() now opens an IMMEDIATE-tx for the per-message
read-then-write block. Under WAL with multiple writers (4 workers + scheduler
+ ingest), a DEFERRED tx takes a read lock then tries to upgrade to write —
and SQLite returns SQLITE_BUSY *immediately* for that upgrade, bypassing the
connection's 15s busy_timeout (deadlock-avoidance behaviour, documented).
Before this fix every Gmail per-message ingest_email failed with 'database
is locked', dropping all messages and stalling composio_sync. IMMEDIATE
acquires the write lock at BEGIN where busy_timeout DOES apply; writers
serialise and wait instead of failing fast. Live evidence: 13/13 fails -> 0/N.

memory_tree::jobs reembed_backfill now persistently tombstones chunks that
fail terminally (body file missing on disk, embed wrong-dim, embed err)
so the worklist query stops re-selecting them on every batch. Two new
idempotent CREATE TABLE IF NOT EXISTS rows: mem_tree_chunk_reembed_skipped
and mem_tree_summary_reembed_skipped. Two new writer helpers
(chunk_store::mark_chunk_reembed_skipped, tree_source::store::mark_summary_
reembed_skipped). Three worklist/probe SQL queries gain an AND NOT EXISTS
... reembed_skipped clause (handler batch in handlers/mod.rs, the
has_uncovered runtime probe in jobs/mod.rs, the has_uncovered migration
probe in store.rs). The handler's 'skipping' log branches now also write
a sentinel via let _ = mark_*_reembed_skipped(...). Live verification on a
dev workspace: 308 orphan chunks each attempted exactly once, sentinel-
marked, chain reached 'fully covered; chain complete' in ~2 min — replaces
a runaway that previously generated ~128k 'body read failed; skipping'
warns across ~8k batch defers for ~16 stuck chunks because the 'skipping'
was log-only with no persistent exit gate.

scripts/run-dev-win.sh — two Windows-contributor bugfixes for
'pnpm dev:app:win':

  (1) After PNPM_EXE is resolved, prepend dirname($PNPM_EXE) to PATH so
      Tauri's cmd /S /C 'pnpm run dev' beforeDevCommand can resolve bare
      'pnpm' from cmd.exe's PATH. The script otherwise only ever calls
      pnpm by absolute path, so its dir was never on PATH and Tauri died
      with ''pnpm' is not recognized as an internal or external command'.

  (2) Right before '"$PNPM_EXE" tauri dev', deduplicate PATH first-seen-
      wins so the MSYS->Windows env-conversion for the spawned native cmd
      doesn't overflow the process-env block limit. Without dedup, vcvars
      rebuild + Git-Bash /etc/profile re-runs + pnpm .bin layering stack
      the system PATH ~5x over (25k+ chars); MSYS conversion drops it and
      the child cmd inherits an EMPTY PATH (not 'pnpm' missing — *every*
      Windows PATH entry missing including %SystemRoot%\System32, so even
      'where' is 'not recognized'). Dedup collapses 317 -> 71 entries on
      this box, well under the env-block ceiling, and beforeDevCommand
      starts running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds persistent tombstones and helpers for terminal re-embed failures, filters tombstoned rows from worklists and coverage probes, updates ingest transaction to Immediate, and fixes Windows dev script PATH/pnpm resolution.

Changes

Windows Dev Script & Persistence Configuration

Layer / File(s) Summary
Windows PATH resolution for pnpm
scripts/run-dev-win.sh
Prepends the resolved pnpm directory to PATH and de-duplicates PATH entries immediately before running tauri dev so Tauri's beforeDevCommand can resolve pnpm by bare name and avoid PATH overflow issues.
Ingest transaction concurrency mode
src/openhuman/memory/tree/ingest.rs
Replaces unchecked_transaction with rusqlite::TransactionBehavior::Immediate in persist to establish explicit locking semantics and document busy-timeout behavior during writes.

Re-embed Backfill Terminal Failure Handling

Layer / File(s) Summary
Tombstone schema tables for terminal failures
src/openhuman/memory/tree/store.rs
Adds mem_tree_chunk_reembed_skipped and mem_tree_summary_reembed_skipped schema entries to persist terminal skip records keyed by `(chunk_id
Coverage probe treats tombstones as covered
src/openhuman/memory/tree/jobs/mod.rs, src/openhuman/memory/tree/store.rs
Updates the migration/coverage probe SQL to treat rows present in the _reembed_skipped tables as covered (adds NOT EXISTS checks) so they do not trigger backfill enqueueing.
Skip marker helper functions
src/openhuman/memory/tree/store.rs, src/openhuman/memory/tree/tree_source/store.rs
Exports mark_chunk_reembed_skipped and mark_summary_reembed_skipped which upsert tombstone rows with reason and current UTC timestamp for idempotent terminal-failure recording.
Worklist filtering excludes tombstoned rows
src/openhuman/memory/tree/jobs/handlers/mod.rs
handle_reembed_backfill selection queries for chunks and summaries add NOT EXISTS predicates against the tombstone tables for the active model_signature, preventing re-selection of already-tombstoned rows.
Error handling records terminal failures
src/openhuman/memory/tree/jobs/handlers/mod.rs
Embed-phase error paths persistently tombstone chunks/summaries (body-read missing, embed errors, pack/dimension mismatches) by calling the new skip helpers with reason strings instead of only logging.
Regression test
src/openhuman/memory/tree/jobs/handlers/mod.rs
Adds a test that removes a staged chunk body, asserts a single tombstone write, verifies the next handler pass returns Done, and confirms the coverage probe marks the space covered.

Sync Status RPC

Layer / File(s) Summary
Module docs and imports
src/openhuman/memory/sync_status/rpc.rs
Clarifies resolved/pending semantics post-#1574 and adds rusqlite::Connection import for the new helper.
RPC refactor to query helper
src/openhuman/memory/sync_status/rpc.rs
Computes now_ms once and delegates aggregation to query_sync_statuses(conn, now_ms) instead of inline logic.
Query helper that uses embedding sidecar
src/openhuman/memory/sync_status/rpc.rs
Introduces query_sync_statuses which uses EXISTS on mem_tree_chunk_embeddings and treats dropped/reembed-skipped chunks as resolved for pending/processed calculations.
Tests for sidecar-driven counts and no active wave
src/openhuman/memory/sync_status/rpc.rs
Adds tests covering sidecar-driven pending/processed counts, no active wave when fully embedded, and dropped/reembed-skipped chunks counted as resolved.

Sequence Diagram(s)

(omitted — changes are schema, query, and helper updates; no new multi-component sequential control flow requiring diagram)

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly Related PRs

Suggested reviewers

  • graycyrus
  • senamakel

Poem

🐰 I hopped through logs and schema beds,
and planted tiny tombstone threads.
One mark to stop the looping chase,
neat PATH fixes, pnpm in place.
Cheers from a rabbit in the code-glade!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title directly summarizes the three main coordinated bugfixes: IMMEDIATE-tx ingest, reembed skip-persistence, and Windows dev-script PATH fixes, matching the changeset's primary objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Pure formatting. CI Rust Quality (fmt) gate flagged my reembed-skip
mark_chunk_reembed_skipped call site for the 'embed wrong dim' branch
in handle_reembed_backfill: rustfmt prefers one argument per line for
that signature with the call's indentation context. No semantic change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 marked this pull request as ready for review May 20, 2026 14:32
@sanil-23 sanil-23 requested a review from a team May 20, 2026 14:32
@coderabbitai coderabbitai Bot added the memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/. label May 20, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
scripts/run-dev-win.sh (1)

497-498: 💤 Low value

Optional: Separate declaration and assignment to avoid masking errors.

Static analysis (SC2155) flags that combining export with command substitution masks the exit code of dirname. While dirname on a validated path is highly reliable, the pattern could be split for defensive consistency.

♻️ Proposed fix
-export PATH="$(dirname "$PNPM_EXE"):$PATH"
-echo "[run-dev-win] pnpm dir prepended to PATH: $(dirname "$PNPM_EXE")"
+PNPM_DIR="$(dirname "$PNPM_EXE")"
+export PATH="$PNPM_DIR:$PATH"
+echo "[run-dev-win] pnpm dir prepended to PATH: $PNPM_DIR"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/run-dev-win.sh` around lines 497 - 498, The export line currently
combines assignment and command substitution which masks dirname's exit code;
split it into two statements: first compute the dirname result into a variable
(e.g., pnpm_dir=$(dirname "$PNPM_EXE") and check for errors), then export
PATH="$pnpm_dir:$PATH" and echo using that variable. Update the lines around the
PATH assignment and the echo in scripts/run-dev-win.sh (references: PNPM_EXE,
dirname, PATH, and the echo "[run-dev-win] ...") so the dirname failure won't be
masked.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@scripts/run-dev-win.sh`:
- Around line 497-498: The export line currently combines assignment and command
substitution which masks dirname's exit code; split it into two statements:
first compute the dirname result into a variable (e.g., pnpm_dir=$(dirname
"$PNPM_EXE") and check for errors), then export PATH="$pnpm_dir:$PATH" and echo
using that variable. Update the lines around the PATH assignment and the echo in
scripts/run-dev-win.sh (references: PNPM_EXE, dirname, PATH, and the echo
"[run-dev-win] ...") so the dirname failure won't be masked.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e64773b6-d1a7-49bb-8ff4-0261e1abc3b7

📥 Commits

Reviewing files that changed from the base of the PR and between 41e7631 and 2b1670f.

📒 Files selected for processing (6)
  • scripts/run-dev-win.sh
  • src/openhuman/memory/tree/ingest.rs
  • src/openhuman/memory/tree/jobs/handlers/mod.rs
  • src/openhuman/memory/tree/jobs/mod.rs
  • src/openhuman/memory/tree/store.rs
  • src/openhuman/memory/tree/tree_source/store.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 20, 2026
Copy link
Copy Markdown
Contributor

@M3gA-Mind M3gA-Mind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #2349: IMMEDIATE-tx ingest + reembed skip-persistence + Windows PATH fixes

Three real, production-observed bugs fixed here, and the root-cause analysis in the PR description is thorough and accurate. The IMMEDIATE-tx change is the right fix for the SQLite busy-handler bypass, the tombstone table design is clean (FK + ON DELETE CASCADE, idempotent upsert, index on signature), and the Windows PATH dedup logic is correct. Approach approved — one blocker before un-drafting.

CodeRabbit flagged SC2155 on the export PATH=$(dirname ...) line in run-dev-win.sh — skipped here.


[major] Tombstone exclusion path has no test coverage

The tombstone mechanism is the heart of the reembed fix, but no test exercises it. The existing reembed_backfill_repopulates_then_completes (line 1062) covers only the success path — it never triggers the failure branches where mark_chunk_reembed_skipped / mark_summary_reembed_skipped are called, so the critical new invariants are unverified:

  1. A terminal failure (body file missing, wrong embed dim) actually writes a tombstone row.
  2. The next batch's worklist NOT EXISTS … reembed_skipped clause excludes it.
  3. has_uncovered (in both jobs/mod.rs and the migration probe in store.rs) returns false when every remaining unembedded chunk/summary has a tombstone — i.e., the chain terminates instead of re-arming forever.

Without (3) in particular, a test regression could silently reintroduce the infinite-rearm loop. A dedicated test that seeds an orphan chunk, runs one backfill batch, asserts the tombstone row was written, runs the has_uncovered probe, and asserts it returns false would close this gap cleanly. The test harness already has reembed_backfill_repopulates_then_completes as a scaffold to follow.


[minor] Fully-qualified crate::... path for summary tombstone calls is asymmetric

See inline comment — chunk side uses the chunk_store:: alias, summary side spells out the full crate::openhuman::memory::tree::tree_source::store:: path. Just add a use import at the top of the handler.


[minor] No built-in path to un-tombstone a row

Once a (chunk_id, model_signature) is tombstoned, the only way to clear it is direct DELETE FROM mem_tree_chunk_reembed_skipped SQL — there's no admin RPC or helper. The PR mentions a follow-up cleanup script for the 308 dev-workspace rows. Worth filing the issue now so it doesn't get lost; if a contributor moves their workspace back the orphan chunks will silently stay skipped until tombstones are cleared manually.


Summary table

File Finding Severity
jobs/handlers/mod.rs No test for tombstone write / worklist exclusion / has_uncovered termination major
jobs/handlers/mod.rs Summary tombstone calls use fully-qualified crate::... path instead of a use import (chunk side uses chunk_store:: alias) minor
store.rs / tree_source/store.rs No un-tombstone helper; manually clear tombstones if workspace paths are repaired minor

The code logic itself is solid — no semantic bugs found. This is a test-coverage gate only.

Comment thread src/openhuman/memory/tree/jobs/handlers/mod.rs Outdated
…as, SC2155 fix

Addresses review on tinyhumansai#2349:

- **[M3gA-Mind: major]** Add `reembed_backfill_tombstones_orphan_and_terminates`
  test pinning the §6 invariants:
    (1) mark_chunk_reembed_skipped writes a row on body-read failure
    (2) next batch's worklist NOT EXISTS clause excludes the tombstoned row
    (3) ensure_reembed_backfill's has_uncovered probe reports covered
  Without (3) a regression could silently reintroduce the infinite-rearm
  loop (~128 k warns / ~8 k defers per 16 orphans, the original bug).
- **[M3gA-Mind: minor]** Switch the three `mark_summary_reembed_skipped`
  calls in `handle_reembed_backfill` to use a `summary_store::` alias,
  matching the existing `chunk_store::` pattern in the same file.
- **[CodeRabbit: SC2155]** Split `export PATH="$(dirname "$PNPM_EXE"):$PATH"`
  in `run-dev-win.sh` into a separate `PNPM_DIR=$(dirname …)` assignment
  + export, so a dirname failure isn't swallowed by the enclosing export.
- **[M3gA-Mind: minor]** Un-tombstone helper tracked in tinyhumansai#2358 — left as a
  follow-up per the review (sticky tombstones are intentional; the
  manual-operator interface is the right scope).

Validation:
- cargo test --lib openhuman::memory::tree → 598 passed (including the new test)
- cargo fmt --check → clean
- cargo clippy -p openhuman --no-deps → no new warnings on touched files

Co-Authored-By: Claude <noreply@anthropic.com>
@sanil-23
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review — all three points addressed in 09331d8.

[major] Tombstone exclusion path test coverage

Added reembed_backfill_tombstones_orphan_and_terminates in src/openhuman/memory/tree/jobs/handlers/mod.rs (right after the existing reembed_backfill_repopulates_then_completes scaffold you pointed at). It exercises the three invariants you called out, end-to-end:

  1. Seeds a chunk row + staged path, then fs::remove_file on the body to reproduce the in-wild orphan failure mode.
  2. Runs handle_reembed_backfill once → asserts JobOutcome::Defer and that mem_tree_chunk_reembed_skipped has exactly one row keyed by (chunk_id, sig).
  3. Runs the handler again → asserts JobOutcome::Done (the NOT EXISTS … reembed_skipped clause in the worklist excludes the tombstone).
  4. Runs the ensure_reembed_backfill migration probe inline → asserts has_uncovered = false, so the chain can't re-arm on the next config save.

That third assertion is the critical one — without it (3) the original 16-orphan → ~128 k warn → ~8 k defer runaway could silently regress and still pass the other two assertions.

[minor] summary_store alias

Added the use crate::openhuman::memory::tree::tree_source::store as summary_store; next to the existing chunk_store alias, and converted the three mark_summary_reembed_skipped calls in the reembed loop to use it. Kept the scope tight — the older fully-qualified tree_source::store:: calls at lines 465 / 486 / 531 / 745 in the same file are pre-existing (PR #2153) and out of this PR's diff, so I didn't sweep them up to keep the review-fix diff focused.

[minor] No path to un-tombstone

Filed #2358 to track. Scoped to a backend helper + optional admin RPC; explicitly noted the "sticky tombstone is intentional, manual operator action is the right interface" framing so we don't accidentally autocorrect them on the next PR.

CodeRabbit nit (SC2155)

While here, split export PATH="$(dirname "$PNPM_EXE"):$PATH" in scripts/run-dev-win.sh into a separate PNPM_DIR=$(…) + export PATH=… so a dirname failure surfaces with a real exit code instead of being eaten by the enclosing export. Single-line nit, fits the same review-fix commit.

Validation

  • cargo test --lib openhuman::memory::tree → 598 passed (including the new test)
  • cargo fmt --check → clean
  • cargo clippy -p openhuman --no-deps (matches the Rust Quality CI step) → no new warnings on touched files

@sanil-23 sanil-23 requested a review from M3gA-Mind May 20, 2026 15:17
@coderabbitai coderabbitai Bot added the bug label May 20, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/run-dev-win.sh`:
- Around line 501-503: PNPM_DIR may be "." when PNPM_EXE is not a path-like
value, which would prepend the current working directory to PATH; after
computing PNPM_DIR from PNPM_EXE, guard the export by checking PNPM_DIR is
non-empty and not "." (and optionally starts with "/" or contains a path
separator) before doing export PATH="$PNPM_DIR:$PATH" and echo; update the block
that sets PNPM_DIR and exports PATH (references: PNPM_EXE, PNPM_DIR, PATH) to
perform this conditional check and only prepend when the check passes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 442a8d46-edae-4716-9dc9-88f993e6bc84

📥 Commits

Reviewing files that changed from the base of the PR and between 2b1670f and 09331d8.

📒 Files selected for processing (2)
  • scripts/run-dev-win.sh
  • src/openhuman/memory/tree/jobs/handlers/mod.rs

Comment thread scripts/run-dev-win.sh Outdated
Addresses CodeRabbit's new actionable on the SC2155 fix from 09331d8:
if `PNPM_EXE` somehow resolves to a bare filename, `dirname` returns ".",
and prepending "." to PATH injects the current working directory — bad
posture on a Windows dev machine where cwd often holds untrusted artifacts.

Skip the prepend when PNPM_DIR is empty or ".", log that we did so, and
fall through. The absolute-path call sites elsewhere in this script still
work; only Tauri's beforeDevCommand needed bare-name pnpm resolution.

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 20, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 20, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by @graycyrus

Solid PR — all three fixes are well-motivated by production evidence and the implementation is clean.

Walkthrough

Area Files What
Rust core ingest.rs DEFERRED → IMMEDIATE tx — correct fix for SQLite busy-handler bypass on read→write lock upgrades
Rust core store.rs, tree_source/store.rs, handlers/mod.rs, jobs/mod.rs Tombstone tables + NOT EXISTS clauses across all 3 worklist/probe sites — properly synchronized
Scripts run-dev-win.sh PATH prepend guard + dedup loop — defensive and well-commented

Notes

  • IMMEDIATE tx is strictly safer than DEFERRED here — acquires write lock at BEGIN where busy_timeout applies. No deadlock risk since the tx is short-lived.
  • Tombstone design (separate tables vs NULL sidecars) is the right call — keeps *_embeddings schema clean and the NOT EXISTS predicates trivial.
  • All 3 SQL sites (handler batch worklist, has_uncovered runtime probe, migration probe) are properly updated with matching NOT EXISTS … reembed_skipped clauses. The in-code cross-references make future drift obvious.
  • Test reembed_backfill_tombstones_orphan_and_terminates pins the critical regression path: tombstone write → worklist exclusion → chain termination → migration probe agreement.
  • CodeRabbit's . guard concern was addressed in e48ee70 with a soft-warning approach — appropriate since absolute-path call sites still work in the degenerate case.

No critical or major findings — nice work. Pending: unit tests for the IMMEDIATE-tx path under simulated lock contention (acknowledged in PR description as follow-up before un-drafting).

graycyrus
graycyrus previously approved these changes May 20, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice work!

… not legacy inline column

memory_sync_status_list computed both lifetime `chunks_pending` and the
active-wave `batch_processed` from `mem_tree_chunks.embedding IS NULL`.
But the tinyhumansai#1574 §7 migration copied every vector into the per-(chunk,model)
`mem_tree_chunk_embeddings` sidecar and stopped writing the inline column,
so it reads back NULL for *every* chunk. The UI therefore showed 100% of
chunks as "pending" and "0 / N processed" forever, no matter how much of
the pipeline had actually drained (observed live: 676 gmail chunks all
"pending", "0 / 2 processed", while 706/727 chunks were embedded in the
sidecar and seals were progressing).

Switch the `provider_chunks` CTE to derive an `embedded` flag via
EXISTS against the sidecar (indexed by the chunk_id PK), and replace the
`embedding IS NULL` / `IS NOT NULL` predicates with `embedded = 0/1`. The
inline column is now never read here — consistent with it being a legacy
additive column that tinyhumansai#1574 abandoned.

Refactor: extract the query + row mapping into `query_sync_statuses(conn,
now_ms)` so it can be unit-tested against a tempdir-backed connection
without the async/spawn_blocking wrapper.

Tests:
- pending_and_processed_key_off_sidecar_not_inline_column: a chunk with a
  sidecar row counts as processed even though its inline column is NULL.
- fully_embedded_provider_reports_no_active_wave: drained provider yields
  batch_total = batch_processed = 0 (UI hides the bar).
- cargo test --lib sync_status::rpc::tests → 5 passed
- cargo fmt + cargo check --lib → clean (only pre-existing unrelated warns)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 dismissed stale reviews from graycyrus and coderabbitai[bot] via b873a6a May 21, 2026 07:14
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 21, 2026
… resolved, not perpetually pending

The pending/processed signal counted a chunk as resolved only if it had a
sidecar embedding row. But two terminal states have no sidecar row and never
will: admission-gate `dropped` chunks, and chunks tombstoned in
`mem_tree_chunk_reembed_skipped` (tinyhumansai#1574 §6: body missing on disk / wrong dim
/ embed error). Those read as "pending" forever and — since the wave anchor
is gated on `pending > 0` — could pin a provider's progress bar below 100%
even though no work is actually outstanding (e.g. the slack orphans:
1 dropped + 20 reembed-skipped showing as 21 perpetual "pending").

Rename the per-row `embedded` flag to `resolved`: true when the chunk has a
sidecar row (any signature) OR `lifecycle_status = 'dropped'` OR a
`mem_tree_chunk_reembed_skipped` tombstone. pending/processed now key off
`resolved`, so only genuinely-in-queue chunks (`pending_extraction`, no
terminal marker) count as pending.

Test: dropped_and_skipped_chunks_count_as_resolved_not_pending — 3 chunks
(dropped, skipped, genuinely-pending) → chunks_pending=1, batch_processed=2.
- cargo test --lib sync_status::rpc::tests → 6 passed
- cargo fmt + cargo check --lib → clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/openhuman/memory/sync_status/rpc.rs (1)

148-160: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate the wave anchor on unresolved chunks inside the window.

Lines 148-160 only require pp.pending > 0, so a provider with a stale unresolved chunk from an older wave and a fully resolved recent chunk still gets non-zero batch_total / batch_processed. That contradicts the contract in Lines 40-44 and can surface a bogus 100%-complete "active" wave. Require at least one unresolved row with created_at_ms >= m.max_created - ?1 before creating an anchor. A regression test for “old pending + recent resolved” would lock this down.

Suggested SQL adjustment
-         provider_pending AS ( \
-            SELECT provider, \
-                   SUM(CASE WHEN resolved = 0 THEN 1 ELSE 0 END) AS pending \
-            FROM provider_chunks \
-            GROUP BY provider \
-         ), \
+         provider_pending AS ( \
+            SELECT p.provider, \
+                   SUM(CASE WHEN p.resolved = 0 \
+                             AND p.created_at_ms >= m.max_created - ?1 \
+                            THEN 1 ELSE 0 END) AS pending \
+            FROM provider_chunks p \
+            JOIN provider_max m ON p.provider = m.provider \
+            GROUP BY p.provider \
+         ), \
          wave_anchors AS ( \
             SELECT p.provider, MIN(p.created_at_ms) AS anchor \
             FROM provider_chunks p \
             JOIN provider_max m ON p.provider = m.provider \
             JOIN provider_pending pp ON p.provider = pp.provider \
             WHERE pp.pending > 0 \
               AND p.created_at_ms >= m.max_created - ?1 \
             GROUP BY p.provider \
          ) \
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/sync_status/rpc.rs` around lines 148 - 160, The
wave_anchors logic is using pp.pending > 0 which counts unresolved chunks across
all time and can surface a stale unresolved chunk as an "active" wave; change
the query to require an unresolved chunk inside the recent window by adding a
windowed unresolved count in provider_pending (e.g. SUM(CASE WHEN resolved = 0
AND created_at_ms >= m.max_created - ?1 THEN 1 ELSE 0 END) AS pending_window)
and then gate wave_anchors with pp.pending_window > 0 (or alternately replace
the pp.pending check with an EXISTS subquery against provider_chunks that looks
for resolved = 0 AND created_at_ms >= m.max_created - ?1); update references
(provider_pending, wave_anchors, provider_chunks, provider_max,
pp.pending_window, p.created_at_ms, m.max_created, ?1) accordingly and add a
regression test for “old pending + recent resolved.”
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/openhuman/memory/sync_status/rpc.rs`:
- Around line 148-160: The wave_anchors logic is using pp.pending > 0 which
counts unresolved chunks across all time and can surface a stale unresolved
chunk as an "active" wave; change the query to require an unresolved chunk
inside the recent window by adding a windowed unresolved count in
provider_pending (e.g. SUM(CASE WHEN resolved = 0 AND created_at_ms >=
m.max_created - ?1 THEN 1 ELSE 0 END) AS pending_window) and then gate
wave_anchors with pp.pending_window > 0 (or alternately replace the pp.pending
check with an EXISTS subquery against provider_chunks that looks for resolved =
0 AND created_at_ms >= m.max_created - ?1); update references (provider_pending,
wave_anchors, provider_chunks, provider_max, pp.pending_window, p.created_at_ms,
m.max_created, ?1) accordingly and add a regression test for “old pending +
recent resolved.”

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4083f9a-8afc-4b02-9b73-531f1efe4ea8

📥 Commits

Reviewing files that changed from the base of the PR and between b873a6a and 20cec10.

📒 Files selected for processing (1)
  • src/openhuman/memory/sync_status/rpc.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 21, 2026
@sanil-23 sanil-23 changed the title fix(memory_tree,scripts): IMMEDIATE-tx ingest + reembed skip-persistence + Windows dev-script PATH fixes fix(memory_tree,sync_status,scripts): IMMEDIATE-tx ingest, reembed skip-persistence, sidecar-based sync-status accounting, Windows dev-script PATH May 21, 2026
… the window (CodeRabbit)

Addresses CodeRabbit's "Major / quick win" on PR tinyhumansai#2349: `provider_pending`
counted `resolved = 0` across all time, so a provider with a stale unresolved
chunk from an older wave plus a fully-resolved recent chunk still produced a
wave anchor — a bogus "active" wave that contradicts the wave contract.

Scope `provider_pending.pending` to chunks within WAVE_WINDOW_MS of the
provider's most recent chunk (join provider_max, reuse the ?1 window param),
so wave_anchors only fires when there is genuinely-unresolved recent work.
Lifetime chunks_pending is computed separately in the outer SELECT and is
unaffected — old stragglers still show in the lifetime count.

Test: stale_out_of_window_pending_does_not_open_a_wave — old unresolved
(30 min ago) + recent resolved ⇒ chunks_pending=1, batch_total=0.
- cargo test --lib sync_status::rpc::tests → 7 passed
- cargo fmt + cargo check --lib → clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanil-23
Copy link
Copy Markdown
Contributor Author

@coderabbitai addressed in a45c5db.

The "gate the wave anchor on unresolved chunks inside the window" finding is fixed: provider_pending.pending is now scoped to chunks within WAVE_WINDOW_MS of the provider's most recent chunk (joins provider_max, reuses the ?1 window param), so a stale out-of-window straggler can't resurrect an "active" wave. Lifetime chunks_pending is unchanged (still counts old stragglers).

Added the regression test you suggested — stale_out_of_window_pending_does_not_open_a_wave: old unresolved (30 min ago) + recent resolved ⇒ chunks_pending=1, batch_total=0. Module suite: 7 passed.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Tip

For best results, initiate chat on the files or code changes.

@sanil-23 Thanks for the update — the fix looks exactly right. Scoping provider_pending.pending to chunks within WAVE_WINDOW_MS of provider_max means a single stale out-of-window straggler can no longer anchor an active wave, while chunks_pending (lifetime count) still surfaces it to the operator so it isn't silently dropped. The regression test covers the precise edge case: a stale unresolved chunk produces batch_total=0 rather than reopening the wave — that's the invariant that would have been easiest to regress silently. 7/7 passing is a clean close on this thread.

The "Rust Core Tests (Windows — secrets ACL)" matrix job was *cancelled*
(not a test failure) on the previous run — its "Run Windows-specific secrets
tests" step succeeded. Linux Rust core tests, clippy, and all coverage jobs
passed. Empty commit to re-fire the workflow; no code change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@M3gA-Mind M3gA-Mind merged commit 6137b67 into tinyhumansai:main May 21, 2026
26 checks passed
MrMrVlad pushed a commit to MrMrVlad/openhuman that referenced this pull request May 21, 2026
…inyhumansai#2358)

Operators can remove chunk/summary tombstones after fixing moved workspaces
or embedder misconfig without hand-editing SQLite. Validates id/signature
inputs; bulk clear is scoped to one model_signature.

Depends on PR tinyhumansai#2349 (tombstone tables).

Co-authored-by: Cursor <cursoragent@cursor.com>
MrMrVlad pushed a commit to MrMrVlad/openhuman that referenced this pull request May 22, 2026
…inyhumansai#2358)

Operators can remove chunk/summary tombstones after fixing moved workspaces
or embedder misconfig without hand-editing SQLite. Validates id/signature
inputs; bulk clear is scoped to one model_signature.

Depends on PR tinyhumansai#2349 (tombstone tables).

Co-authored-by: Cursor <cursoragent@cursor.com>
CodeGhost21 pushed a commit to CodeGhost21/openhuman that referenced this pull request May 22, 2026
…ip-persistence, sidecar-based sync-status accounting, Windows dev-script PATH (tinyhumansai#2349)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: sanil-23 <sanil@alphahuman.xyz>
MrMrVlad pushed a commit to MrMrVlad/openhuman that referenced this pull request May 22, 2026
…inyhumansai#2358)

Operators can remove chunk/summary tombstones after fixing moved workspaces
or embedder misconfig without hand-editing SQLite. Validates id/signature
inputs; bulk clear is scoped to one model_signature.

Depends on PR tinyhumansai#2349 (tombstone tables).

Co-authored-by: Cursor <cursoragent@cursor.com>
senamakel pushed a commit to aqilaziz/openhuman that referenced this pull request May 23, 2026
…ip-persistence, sidecar-based sync-status accounting, Windows dev-script PATH (tinyhumansai#2349)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: sanil-23 <sanil@alphahuman.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/. working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants